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Abstract We revisit the adaptive Lasso as well as the thresholded Lasso with 
refitting, in a high-dimensional linear model, and study prediction error, ^g-error 
{q G {1, 2}), and number of false positive selections. Our theoretical results for 
the two methods are, at a rather fine scale, comparable. The differences only 
show up in terms of the (minimal) restricted and sparse eigenvalues, favoring 
thresholding over the adaptive Lasso. As regards prediction and estimation, 
the difference is virtually negligible, but our bound for the number of false 
positives is larger for the adaptive Lasso than for thresholding. Moreover, both 
these two-stage methods add value to the one-stage Lasso in the sense that, 
under appropriate restricted and sparse eigenvalue conditions, they have similar 
prediction and estimation error as the one-stage Lasso, but substantially less 
false positives. 
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1 Introduction 

Consider the linear model 

Y = X/3 + e, 

where /3 G is a vector of coefficients, X is an (n x p)-design matrix, and Y 
is an n-vector of noisy observations, e being the noise term. We examine the 
case p > n, i.e., a high-dimensional situation. The design matrix X is treated 
as fixed, and the Gram matrix is denoted by S := X^X/n. Throughout, we 
assume the normalization Sjj = 1 for all j G {1, . . . ,p}. 

This paper presents a theoretical comparison between the thresholded Lasso 
with refitting and the adaptive Lasso. Both methods are very popular in prac- 
tical applications for reducing the number of active variables. 

We emphasize here and describe later that we allow for model misspecification 
where the true regression function may be non-linear in the covariates. For such 
cases, we can consider the projection onto the linear span of the covariates. 
The (projected or true) linear model does not need to be sparse nor do we 
require that the non-zero regression coefficients (from a sparse approximation) 
are "sufficiently large". As for the latter, we will show in Lemma 13.31 how this 
can be invoked to improve the result. Furthermore, we also do not require the 



stringent irrepresentable conditions or incoherence assumptions on the design 
matrix X but only some weaker restricted or sparse eigenvalue conditions. 

Regularized estim ation with the £i-norm penalty, also known as the Lasso 



(jTibshiranil [I996l |). refers to the following convex optimization problem: 



p := argmm|||Y - X/3||^/n + A||/3||i|, (1) 
where A > is a penalization parameter. 

Regularization with £i-penalization in high-dimensional scenarios has become 
extremely popular. The methods are eas y to use, due to r e cent progress in 



specif ically tailored convex optimization (jMeier et al.l [2008l | . iFriedman et al 
2013]). 



A two-stage version of the Lasso is the so-called adaptive Lasso 

/5adap := argminjllY - X^||^/n + AjnitAadap ^7^^\- (2) 

P I j = l |Pi,mit| J 

Here, /3init is the one-stage Lasso defined in ([T|), with initial tuning parameter 
A = Ainit, and Aadap > is the tuning parameter for the second stage. Note 
that when |/3j,init| = 0, we exclud e variable i in the second stage. The adaptive 



Lasso was originally proposed by IZoul 20061 ] . 



Another possibility is the thresholded Lasso with refitting. Define 

'S'thres = {j '■ |/3j,init| > Athres}; (3) 

which is the set of variables having estimated coefficients larger than some given 
threshold Athres- The refitting is then done by ordinary least squares: 

•^thrcs 



6thres = arg min ||Y - X/3a |||/n. 



thrcs 



where, for a set S C {1, . . . fis has coefficients different from zero at the 
components in S only. 

We will present bounds for the prediction error, its ^^-error {q G {1,2}), and 
the number of false positives. The bounds for the two methods are qualita- 
tively the same. A difference is that our variable selection properties results for 
the adaptive Lasso depend on its prediction error, whereas for the thresholded 
Lasso, variable selection can be studied without reference to its prediction er- 
ror. In our analysis this leads to a bound for the number of false positives 
of the thresholded Lasso that is smaller than the one for the adaptive Lasso, 
when restricted or sparse minimal eigenvalues are small and/or sparse maximal 
eigenvalues are large. 

Of course, such comparisons depend on how the tuning parameters are chosen. 
Choosing these by cross validation is in our view the most appropriate, but it 
is beyond the scope of this paper to present a mathematically rigorous theory 
for the cross validation scheme for the adaptive and/or thresholded Lasso (see 
Arlot and Celissel |20in l] for a recent survey on cross validation). 
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1.1 Related work 



Con sistency results for the prediction error of the Lasso can be found in lGreenshtein and Ritov 
[2004] . The prediction error is asympt oticahy oracle optirnal und er certain con 



ditions on the design matrix X, see e.g. Buiiea et al. 2006 , 2007al f5| , Ivan de Geer 



2008l |. iBickel et all |2009t |. iKoltchinskiil [2009a| ]bt. where also estimation 



m 



te rms of the i) - or ^ 2-loss is c onsidered. The "res tricted eigenvalue condition" 
of lBickel et aLf|2009l | (see also IKoltchinskiil [2009al lbt') plays a key role here Re- 
stricted eigenvalue conditions are implied by, but generally much weaker than, 
"inc oherence" conditio i is, wh ich exclude high correlations between co- variables. 
Also I Candes and Plan 20091 ] allow for a major relaxation of incoherence con- 
ditions, using assumptions on the set of true coefficients. 

There is however a bias problem with ^i-penalization, due to the shrinking of 
the estimates which corre spond to true signa l variables. A discussion can be 
found in IZoul 2006l |. and iMeinshausenI 20071]. Moreover, for consistent vari- 
able selection with the Lasso, it is known that the so-called "neighborhood 
stability condition" (jMeinshausen and Biihlmanni 200fi]) for the design matrix, 
which has been re-formul ated in a nicer form as the "irrepresen table condi- 
tion" (IZhao and Yul \200(^ ). is sufficient and essentially necessary. IWainwrightl 
' 200il2009l ] analyzes the smallest sample size needed to recover a sparse signal 



under certain incoherence conditions. Because irrepresentable or incoherence 
conditions are restrictive and much stronger than restricted eigenvalue condi- 
tions (see van de Geer and Biihlmann 20091 ] for a comparison), we conclude 
that the Lasso for exact variable selection only works in a rather narrow range 
of problems, excluding for example some cases where the design exhibits strong 
(empirical) correlations. 

Regularizatio n with t he ig - "norm" with g < 1 will mitigate some of the bias 
problems, see IZhangI |2010l ]. Related are multi-step procedures where each of 
the steps involves a convex optimization only. A prime example is the adaptive 
Lasso which is a two-step algorithm and whose repeated app lication correspon ds 
in some "lo ose" sense to a non-convex penalization scheme ( Zou and Li 20081 ]). 
Zml |200fil ] analyzed the adaptive Lasso in an asymptotic setup for the case 
where p is fi xed. Further progr ess in the high-dimensional scenario has been 



progr e 

achieved by iHuang et al.l |2008l |. Under a rather strong mutual incoherence 
condition between every pair of relevant and irrelevant covariables, they prove 
that the adaptive Lasso recovers the correct model and has an oracle property. 
As we will explain in Subsection [631 the adaptive Lasso indeed essentially needs 
a - still quite restrictive - weighted version of the irrepresentable condition in 
order to be able to correctly estimate the support of the coefficients. 



Meinshausen and Yul 20091 ] examine the thresholding procedure, assuming all 
non-zero components are large enough, an assumption w e will avoid. Thresh - 
olding and inultista ge procedures are als o considered in ICandes et aP |2006l ]. 



Candes et all |2008l ]. In lzhoul |2009l . l20inl ]. it is shown that a multi-step thresh- 
olding procedure can accurately estimate a sparse vector (3 £ MP under the 
restricted eigenvalue condition of lBickel et al 



20091 ]. The two-stage proced 



ure 
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Zhand |2009l ] applies "selective penalization" in the second stage. This pro- 



cedure is studied assuming incoherence conditi ons. A more general framewor k 
for multi-stage variable selection was studied by Wasserman and Roeder 2009l | . 
Their approach controls the probability of false positives (type I error) but pays 
a price in terms of false negatives (type II error) . The main contribution of this 
paper is that we provide bounds for the adaptive Lasso that are comparable 
to the bounds for the Lasso followed by a thresholding procedure. Because the 
true regression itself, or its linear projection, is perhaps not sparse, we more- 
over consider a sparse a pproximation of the truth, somewhat in the spirit of 



Zhang and Huand j2008l |. 



1.2 Organization of the paper 

The next section introduces the sparse oracle approximation, with which we 
compare the initial and adaptive Lasso. In Section [Sj we present the main 
results. Eigenvalues and their restricted and sparse counterparts are defined in 
Section m Some conclusions are presented in Section [5l 

The rest of the paper presents intermediate results and complements for estab- 
lishing the main results of Section [3l In Section [6l we consider the noiseless 
case, i.e., the case where e = 0. The reason is that many of the theoretical issues 
involved concern the approximation properties of the two stage procedure, and 
not so much the fact that there is noise. By studying the noiseless case first, 
we separate the approximation problem from the stochastic problem. 

Both initial and adaptive Lasso are special cases of a weighted Lasso. We discuss 
prediction error, iq-eiior (q E {1,2}) and variable selection with the weighted 
Lasso in Subsection 16.11 Theorem 16.11 in this section is the core of the present 
work, as regards prediction and estimation. Lemma 16.11 in this section is the 
main result as regards variable selection. The behavior of the noiseless initial 
and adaptive Lasso are simple corollaries of Theorem 16.11 and Lemma 16.11 We 
give in Subsection 16.21 the resulting bounds for the initial Lasso and discuss 
in Section 16.31 its thresholded version. In Subsection 16.41 we derive results for 
the adaptive Lasso by comparing it with a thresholded initial Lasso. Moreover, 
Subsection 16.51 brieflv discusses the weighted irrepresentable condition, to show 
that even the adaptive Lasso needs strong conditions on the design for exact 
variable selection. This subsection is linked to Corollarv 13.21 where it is proved 
that the false positives of the adaptive Lasso vanish if the coefficients of the 
oracle are sufficiently large. 

Section [3 studies the noisy case. It is an easy extension of the results of Sections 
16.11 16.21 16.31 and 16.41 We do however need to further specify the choice of the 
tuning parameters Ainit and Aadap- After explaining the notation, we present 
the bounds for the prediction error, estimation error and for the number of false 
positives, of the weighted Lasso. This then provides us with the tools to prove 
the main results. 

All proofs are in Section [8l Here, we also present explicit constants in the 
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bounds to highlight the non-asymptotic character of the results. 



2 Model misspecification, weak variables and the or- 
acle 

Let 

EY := f°, 

where f'^ is the regression function. First, we note that without loss of generality, 
we can assume that f *^ is linear. If f*^ is non-linear in the covariates, we consider 
its projection X/3true onto the linear space {X/3 : /5 € W}, i.e., 

XArue :=arginin||fO-X^||2. 

It is not difficult to see that all our results still hold if f " is replaced by its pro- 
jection X/3true- The statistical implication is very relevant. The mathematical 
argument is the orthogonality 

X^(X/3t,„e - f°) = 0. 
For ease of notation, we therefore assume from now on that f ° is indeed linear: 

'■= X/3true- 

Nevertheless, /3true itself may not be sparse. Denote the active set of Ptme by 

S^:={j: /3j,true 7^ 0}, 

which has cardinality strue '■= I'S'truel- It may well be that strue is quite large, 
but that there are many weak variables, that is, many very small non-zero co- 
efficients in /Strue- Therefore, the sparse object we aim to recover may not be 
the "true" unknown parameter /3truc £ of the linear regression, but rather a 
sparse approximation. We believe that an extension to the case where f'^ is only 
"approximately" sparse, better reflects the true state of nature. We emphasize 
however that throughout the paper, it is allowed to replace the oracle approx- 
imation W given below by /3truc- This would simplify the theory. However, we 
have chosen not to follow this route because it generally leads to a large price 
to pay in the bounds. 

The sparse approximation of f'^ that we consider is defined as follows. For a set 
of indices S C {1, . . . ,p} and for P G MP, we let 

Pj,s ■.= Pjl{jeS}, j = l,...,p. 

Given a set S, the best approximation of f° using only variables in S is 

f5 = X6^ := arg min ||/-fl2. 

/=-X-PS 
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Thus, is is the projection of f'' on the hnear span of the variables in 5. Our 
target is now the projection fs^, where 

5o := arg min j ||fs - f%/n + 7Xf^^\S\/cP\6, S) 

Here, \S\ denotes the size of S. Moreover, (j)'^{6,S) is a "restricted eigenvalue" 
(see Section [J] for its definition), which depends on the Gram matrix S and 
on the set S. The constants are chosen in relation with the oracle result (see 
Corollary I8.3p . In other words, f^y is the optimal £o"PGnalized approximation, 
albeit that it is discounted by the restricted eigenvalue (j)'^{6,So). To facilitate 
the interpretation, we require 5o to be a subset of 5true> so that the oracle is 
not allowed to trade irrelevant coefficients against restricted eigenvalues. With 
Sq C S'true! any false positive selection with respect to iStrue is also a false 
positive for Sq. 

We refer to is^ as the "oracle". The set So is called the oracle active set, and 
are the oracle coefficients, i.e.. 

We write sq = \So\. 

Inferring the sparsity pattern, i.e. variable selection, refers to the task of esti- 
mating the set of non-zero coefficients, that is, to have a limited number of false 
positives (type I errors) and false negatives (type II errors). It can be verified 
that under reasonable conditions with suitably chosen tuning parameter A, the 
"ideal" estimator 

Adeai := argmmjllY - Xf3g/n + X^\{j : / 0}| 



has (\ So ) prediction error a. nd O(so) false positives (see for instance lBarron et al 



and Ivan de Geer ' \200l))^ With this in mind, we generally aim at 0{so 



false positives (see also ,Zhou |201Cll |). yet keeping the prediction error as small 
as possible (see Corollary 13. ip . 



As regards false negative selections, we refer to Subsection 13.51 where we derive 



bounds based on the ^^-error. 



3 Main results 
3.1 Main conditions 

The behavior of the thresholded Lasso and adaptive Lasso depends on the 
tuning parameters, on the design, as well as on the true f'', and actually on the 
interplay between these quantities. To keep the exposition clear, we will use 
order symbols. Our expressions are functions of n, p, X, and f'^, and also of 
the tuning parameters Amit, Athres! and Aadap- For positive functions g and h, 
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we say that g = 0(h) if ||(7//i||oo is bounded, and g x /i if in addition 
is bounded. Moreover, we say that g = Osusih) if ||5//i||oo is not larger than a 
suitably chosen sufficiently small constant, and g Xguff h if in addition ||/i/(?||oo 
is bounded. 

Our results depend on restricted eigenvalues (p{L, S, N), minimal restricted 
eigenvalues 4'ram{L, S, N), and minimal sparse eigenvalues (psparseiS , N) (which 
we generally think of as being not too small), as well on maximal sparse eigen- 
values Agparsel-s) (which we generally think of being not too large). The exact 
definition of these constants is given in Section HI 

To simplify the expressions, we assume throughout that 

\\fso-t%/n = 0{Xl,,so/ct>\6,So)) (4) 

(where i;^(6, 5*0) = (p{6, Sq, sq)), which roughly says that the oracle "squared 
bias" term is not substantially larger than the oracle "variance" term. For 
example, in the case of orthogonal design, this condition holds if the small non- 
zero coefficients are small enough, or if there are not too many of them, i.e., 
if 

We stress that dH) is merely to write order bounds for the oracle, bounds with 
which we compare the ones for the various Lasso versions. If actually the 
"squared bias" term is the dominating term, this mathematically does not alter 
the theory but makes the result more difficult to interpret. 

We will furthermore discuss the results on the set 

T: 



< 4 max \e^'Kn/n\ < Ainit \ 
{ i<i<p ^ J 



init j ! 

where Xj is the j-th column of the matrix X. For an appropriate choice of Ainit, 
depending on the distribution of e, the set T has large probability. Typically, 
Ainit can be taken of order 

ylogp/n. 

The next lemma serves as an example, but the results can clearly be extended 
to other distributions. 

Lemma 3.1 Suppose that e ~ M{0, o"^/). Take for a given t > 0, 



/2t + 21ogp 

Ainit = 4(jW . 

V n 

Then 



P(T) > 1 - 2exp[-t]. 

The following conditions play an i mportant ro le. Conditions A and AA for 
thresholding are similar to those in IZhoul |2nid | (Theorems 1.2, 1.3 and 1.4). 
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Condition A For the thresholded Lasso, the threshold level Athres is chosen 
sufficiently large, in such a way that 



i.2(6,5o,2so) 



Ainit — Osuff(^thres) 



Condition AA For the thresholded Lasso, the threshold level Athrcs is chosen 
sufficiently large, but such that 



1 



62(6,5o,2so) 



A 



init ^sufT 



Athres • 



Condition B For the adaptive Lasso, the tuning parameter Aadap is chosen 
sufficiently large, in such a way that 



A, 



sparse 



63. 

mm 



(2,50,250) 



Ai- 



Osuff (Aadap) 



Condition BB For the adaptive Lasso, the tuning parameter Aadap is chosen 
sufficiently large, but such that 

.^sparse('So) 

.<i„(6,5o,2so) 



A 



init '^sufE Aadap 



The above conditions can be considered with a zoomed-out look, neglecting 
the expressions in the square brackets ([•••]), and a zoomed-in look, taking 
into account what is inside the square brackets. One may think of Ajnit as the 
noise level (see e.g. Lemma l3.ll with the logp-term the price for not knowing 
the relevant coefficients a priori). Zooming out, Conditions A and B say that 
the threshold level Athres and the tuning parameter Aadap are required to be at 
least of the same order as Ajnit, i.e., they should not drop below the noise level. 
Assumption AA and BB put these parameters exactly at the noise level, i.e., 
at the smallest value we allow. The reason to do this is that one then can have 
good prediction and estimation bounds. If we zoom in, we see in the square 
brackets the role played by the various eigenvalues. As they are defined only 
later in Section [H it is at first reading perhaps easiest to remember that the 4>''s 
can be small and the A's can be large, but one hopes they behave well, in the 
sense that the values in the square brackets are not too large. 



3.2 The results 



The next three theorems cont ain the main ingred i ents of the present work. The - 



orem [Q is not new (see e.g. iBunea et all [200d . bnOTal lb!]. iBickel et al.l |2nnfll |. 



Koltchinskiil [2009a( | ) , albeit that we rep lace the perhaps non-sparse /3true by the 

Recall that the latter replacement is 



200. 



sparser b^ (see also Ivan de Geer 
done because it yields generally an improvement of the bounds. 
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Theorem 3.1 For the initial Lasso /3init = P defined in ([7]j, we have on T, 



and 



and 



|X/3init-r||^/n 



3init — 



1 



62(6, So) J 

1 



0(Af„itSo) 



''init 



02(6, 5o) 
1 



O(AinitSo), 



0(Ainit\/so). 



2(6,5o,2so). 

The next t heorem discusses thresholding. The resuhs correspond to those in 



Zhoul [2O10l |. and will be invoked to prove similar bounds for the adaptive Lasso, 



as presented in Theorem 13.31 

Theorem 3.2 Suppose Condition A holds. Then on T, 



|X/?thres — f^lli/"^ 



-^sparse ('^o) 



A2 
'^init 



and 



and 



and 



ll^thres — 



l^thres — ^''ib 



■^sparse ("50 ) 
V^sparse 

■^sparse('5o) 



Athres 
Ainit 

Athres 



O(AinitSo), 



|5'thres\5'o 



sparse (50,250)] Ainit 
1 



0( Ainit 



§^0{so). 

'^thres 



64(6,5o,2so) 

Theorem 3.3 Suppose Condition B holds. Then on T 



|X/3adap-f°||2/ri 



-^sparse ( '50 ) 
iin(6,S'o,2so) 



Aadap 
Ainit 



C*(AfnitS0), 



and 



and 



||/3adap - 6° 



and 



^adap — ^*^l|2 



I'S'adapV'S'o 



A 1/2 , X 

■''-sparse V'5o ) 

<^lfn(6,5o,2so) 

As^arse(so)0m^n(6, ^o, 2so) 



'adap 
Ainit 



O(AinitSo), 



62 
mm 



(6, 5o,3so) 



' adap 



*init 



O (Ainit \/so) 



AUe(^o) Asparsel 



•So) 



6''(6, 5o, 2so) (Amm(6, So, 2so) 



^init 



' adap 



O(so). 



9 



We did not present a bound for the number of false positives of the initial Lasso: 
it can be quite large depending on further conditions as given in Lemma 17.11 
A rough bound is presented in Lemma |3.2[ 



Theorem 13.21 and 13.31 show how the results depend on the choice of the tuning 
parameters Athres and Aadap- The following corollary takes the choices of Condi- 
tions A A and BB, as these choices give the smallest prediction and estimation 
error. 



Corollary 3.1 Suppose we are on T ■ Then, under Condition A A, 

.(so) 



1X6, 



and 



and 



and 



thres 



l^thres ~ b^Wl 



f0||2 



/n 



A2 



sparse v 



L<A4(6,5o,2so) 



0(Ai„itSo), 



A, 



sparse 



(so) 



(5o,2so)02(6,5o,2so) 



'^sparse 



-'thres 



5° I 



A, 



sparse 



(so) 



, V^sparse 



©(AinitSo), 
0(Ainit\/so), 



|5'thres\'S'o| = O{so). 

Similarly, under Condition BB, 



and 



and 



and 



|X/?adap " 
1 1 /^adap 



f0||2 

I 9 



6° I 



/n 



■^sparse(so) 
mm 



(6,5o,2so) 



O(ALtSo) 



A, 



sparse 



(so) 



mm 



(6,50,250) 



^adap 



A. 



sparse 



(so) 



|»S'adap\»S'o 



'Amin(6, So, 3so)0min(6, Sq, 2so). 

(■50)</'min(6, -50,250 



O(AmitSo), 



A2 

^ ^sparse V 



0{so). 



(5) 



(6) 



(7) 



(8) 



(/.4(6,5o,2so) 

Remark 3.1 Note that our conditions on Athres and Aadap depend on the ^'s 
and A's, which are unknown. Indeed, our study is of theoretical nature, reveal- 
ing common features of thresholding and the adaptive Lasso. Furthermore, it is 
possible to remove the dependence of the <j)^s and A's, when one imposes stro nger 
sparse eigenvalue conditions, along the lines of Zhang and Huang 2008l |. In 
practice, the tuning parameters are generally chosen by cross validation. 
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3.3 Comparison with the Lasso 



At the zoomed-out level, where all (/)'s and A's are neglected, we see that the 
thresholded Lasso (under Condition AA) and the adaptive Lasso (under Con- 
dition BB) achieve the same order of magnitude for the prediction error as the 
initial, one-stage Lasso discussed in Theorem 13.11 The same is true for their 
estimation errors. Zooming in on the (f)^s and the A's, their error bounds are 
generally larger than for the initial Lasso. 

For comparison in terms of false positives, we need a corr esponding bound for 
the initial Lasso. In the paper of I Zhang and Huang] [2008|, one can find results 
that ensure that also for the initial Lasso, modulo (p's and A's, the number of 
false positives is of order sq. However, this result requires rather involved con- 
ditions which also improve the bounds for the adaptive and thresholded Lasso. 
We briefly address this refinemen t in Subsection 17. 3[ impo sing a condition of 
similar nature as the one used in IZhang and Huang) 2008l | . Also under these 
stronger conditions, the general message remains that thresholding and the 
adaptive Lasso can have similar prediction and estimation error as the initial 
Lasso, and are often far better as regards variable selection 

In this section, we confine ourselves to the following lemma. Here, A."^^^ is the 
largest eigenvalue of S, which can generally be quite large. 



Lemma 3.2 On T , 



|'S'mit\5'o| < 



" A2 

max 

>2(6,5o) 



0{so) 



3.4 Comparison between adaptive and thresholded Lasso 

When zooming-out, we see that the adaptive and thresholded Lasso have bounds 
of the same order of magnitude, for prediction, estimation and variable selection. 

At the zoomed-in level, the adaptive and thresholded Lasso also have very 
similar bounds for the prediction error (compare ([5]) with ([7])) in terms of the i;^'s 
and A's. A similar conclusion holds for their estimation error. We remark that 
our choice of Conditions AA and BB for the tuning parameters is motivated 
by the fact that according to our theory, these give the smallest prediction 
and estimation errors. It then turns out that the "optimal" errors of the two 
methods match at a quite detailed level. However, if we zoom-in even further 
and look at the definition of (jjspaise, 'Pi ^-nd (prain in Section HI it will show 
up that the bounds for the adaptive Lasso prediction and estimation error are 
(slightly) larger. 

Regarding variable selection, at zoomed-out level the results are also comparable 
(see ([6]) and (HI)). Zooming-in on the the (j)^s and A's, the adaptive Lasso may 
have more false positives than the thresholded version. 

A conclusion is that at the zoomed-in level, the adaptive Lasso has less favorable 
bounds as the refitted thresholded Lasso. However, these are still only bounds. 



11 



which are based on focussing on a direct comparison between the two methods, 
and we may have lost the finer properties of the adaptive Lasso. Indeed, the 
non-explicitness of the adaptive Lasso makes its analysis a non-trivial task. 
The adaptive Lasso is a quite popular practical method, and we certainly do 
not advocate that it should always be replaced by thresholding and refitting. 

3.5 Bounds for the number of false negatives 

The iq-erroi has immediate consequences for the number of false negatives: if 
for some estimator /3, some target 6°, and some constant dq'^'^^^ one has 

11/3 - b\ < 5™ 



then the number of undetected yet large coefficients cannot be very large, in 
the sense that 



^upper 



Therefore, on T, for example 



j ■■ /3,-,mit = 



^2(6,5o,2so) 



0. 



Similar bounds hold for the thresholded and the adaptive Lasso (considering 
now, in terms of the (/>'s and A's, somewhat larger \b^\)- 

One may argue that one should not aim at detecting variables that the oracle 
considers as irrelevant. Nevertheless, given an estimator /3, it is straightforward 
to bound 11/3 — /3true||g in terms of ||/3 — apply the triangle inequality 

||/3-/3true||, < ||/3-6°||9 + ||6°-/3true||g. 

Moreover, for q = 2, one has the inequality 

llf _ f0||2 
WhP R l|2 < ll^'S'o ^ Il2 
IP Ptruc||2 ^ 77; 

where Ajiiin('S') is the smallest eigenvalue of the Gram matrix corresponding to 
the variables in S. One may verify that i;/>(6, S'tme) < Amin(5'true)- In other 
words, by choosing /3true as target instead of , does in our approach not lead 
to an improvement in the bounds for ||/? — Aruelb- 



3.6 Having large coefficients 



Let us have a closer look at what conditions on the size of the coefficients can 
bring us. We on ly discuss th e adaptive Lasso (thresholding again giving similar 
results, see alsolZhoul [2ninl |^. 
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We define 



Moreover, we let 



|6°|min := min 



fll2 
I harm 



be the harmonic mean of the squared coefficients. 

Condition C For the adaptive Lasso, take Aadap sufficiently large, such that 

\b I harm — Osuff (-^adap) • 



Condition CC For the adaptive Lasso, take Aadap sufficiently large, hut such 
that 

1 1,0 1 ^ \ 

I" I harm ^suff ^adap- 



Lemma 3.3 Suppose that for some constant 5^^ , on T, 



Assume in addition that 
Then under Condition C, 



3init-6l|oo<CP''". 



|X^fdap-f°lli/n 



1 



and 



and 



and 



||4dap-^'°| 



||/3adap - 6° 



|5'adap\5'o| 



02(6, So) 
1 



O(^hiif50), 



A 



<A2(6,5o) 
1 

i)2(6,5o,2so) 



^sparse ( ■50 ) 



\2 

adap 

harm 

O(AinitSo), 

0(Ainit\/so), 



adap 



harm 
Aadap 



|&°|harm 

o 



>2(6,5o)</<4(6,5o,2so) 
It is clear that by Theorem 13.11 



3init 



A 



1 



harm / 



0(Amit\/so)- 



02(2, 5o) 02(2,5o,2so). 

This can be improved under coherence conditions on the Gram matrix. To 
simplify the e xposition, we will not discuss such improvements in detail (see 
Lounicil l200^" 



Under Condition CC, the bound for the prediction error and estimation error 
is again the smallest. We moreover have the following corollary for the number 
of false positives. 
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Corollary 3.2 Assume the conditions of Lemma \3.3\ and 

(/)^(6,S'o,2so)AinitV^= 0(|6°|harm)- 

Then on T, 

■^sparse ('^o) 



|»S'adap\»S'o 



0(1). 



By assuming that jfe'^lharm is sufficiently large, that is, 



-^sparse ( So) 



</.(6,5o)02(6,5o,2so) 



Ainit\/so — Osuff(|^*^|harm 



one can bring |5'adap\5'o| down to zero, i.e., no false positives. One may verify 
that this boils down to a situation where the weighted irrepresentable condition 
holds: see Example 16.11 in Subsection 16.51 



As discussed in Section 13.51 large non-zero coefficients also lead to a small 
number or eventually zero false negative selections. Therefore, the adaptive 
and thresholded Lasso are recovering the support of 5*0 if all of its non-zero 
coefficients are sufficiently large (in absolute value), assuming much weaker 
conditions on the design than the (unweighted) irrepresentable condition, which 
is necessary for the Lasso. 



4 Notation and definition of generalized eigenvalues 

We reformulate the problem in L2{Q), where Q is a generic probability measure 
on some space X. (This is somewhat more natural in the noiseless case, which 
we will consider in Section [6l) Let {ipj}^^i C L2{Q) be a given dictionary. For 
j = 1, . . . ,p, the function -i/^j will play the role of the j'-th co- variable. The 
Gram matrix is 

We assume that S is normalized, i.e., that J ip'jdQ = 1 for all j. In our final 
results, we will actually take S = S, the (empirical) Gram matrix corresponding 
to fixed design. 

Write a linear function of the ipj with coefficients /3 S as 

p 

fl3 ■.= ^ipjPj. 

i=i 

The L2(Q)-norm is denoted by || • ||, so that 

'2 - /3^S/3. 



Recall that for an arbitrary /3 E M^, and an arbitrary index set S, we use the 
notation 
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We now present our notation for eigenvalues. We also introduce restricted 
eigenvalues and sparse eigenvalues. 



4.1 Eigenvalues 

The largest eigenvalue of S is denoted by A^^^, i.e. 



A™„Y := max B^TiB. 

"""" Wh=i 

We will also need the largest eigenvalue of a submatrix containing the inner 
products of variables in S: 

Js!L„(S) := max B'^T.Bs- 



Its minimal eigenvalue is 



I|PS||2 = 1 



4.2 Restricted eigenvalues 



A restricted eigenvalue is of similar nature as the minimal eigenvalue of S, 
but with the coefficients /3 restricted to certain subsets of M^. The restricted 
eigenvalue coi idition we impose corresponds to th e so-called adaptive version as 
introduced in van de Geer and Biihlmann 20091]. It differs frorn the restricted 
eigenvalue condition in iBickel et al.l |2009l | or lKoltchinskiil |2009al lbl| . This is due 
to the fact that we want to mimic the oracle f^p, that is, do not choose f'^ as 
target, so that we have to deal with a bias term Wis^ — f°||. For a given 5*, our 



re stricted eigenvalue c ondition is stronger than the one in iBickel et all |2009l | 



or 



Koltchinskiil |2009al lbl|. On the other hand, we apply it to the smaller set S( 



instead of to Strue- 

Define for an index set S" C {1, 
the sets of restrictions 



and for a set A/" D S and constant L > 0, 



max I /3o I 



< 



min I /3j 



Definition: Restricted eigenvalue. For N > \S\, we call 



'iL,S,N) := min 



: M D S, \M\ < N, /3 e 1Z{L,S,M) 



the (L, 5, 7V)-restricted eigenvalue. The (L, 5, A^)-restricted eigenvalue condi- 
tion holds if(j){L,S,N) > 0. 

For the case N = \S\, we write (j){L, S) := 4>{L, S,\S\). 
The minimal (L, S, A^)-restricted eigenvalue is 



mm 



(L,S,N):= min (t)'^(L,M)- 
AfDS, \M\=N 
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It is easy to see that (p^imiL, S, N) < (j){L,S,N) < (p{L,S) < Amin('S') for all 
L > 0. It can moreover be shown that 

<P\L,S,2\S\)>mm\\\f^f : M D S, \M\ = 2\S\, m42<l, WMU = I 



4.3 Sparse eigenvalues 

The fact that we also need sparse eigenvalues is in line with the sparse Riesz 
condition occurring in Zhang and Huang 2008l |. 



Definition: Sparse eigenvalues. For N G {!,..., p}, the maximal sparse 
eigenvalue is 

Asparse(iV) = max Ama.A-^)- 
^ Af: \Af\=N 

For an index set S C {1, ■ ■ ■ ,p} with \S\ < N, the minimal sparse eigenvalue is 
(psp3.Tse{S,N) := min Amin(7V). 

^ AfDS: \Af\=N 



One easily verifies that for any set M with |A^| = ks, /c G N, 

sparse V'' J ■ 

Moreover, for all L > 0, 

</'sparse(5, N) = 0(0, S, N) > (t>{L, S, N). 

5 Conclusions 

We present some comparable bounds for the adaptive Lasso and the thresholded 
Lasso with refitting and we also compared them to the ordinary Lasso. The 
framework of our analysis allows for misspecified linear models whose best linear 
projection is not necessarily sparse and with possibly small non-zero regression 
coefficients, i.e., many weak variables. This setting is much more realistic than 
the usual high-dimensional framework where the model is true with only a few 
but strong variables. 

Estimating the support Sq of the non-zero coefficients is a hard statistical prob- 
lem. The irrepresentable condition, which is essentially a necessary condition 
for exact recovery of the non-zero coefficients by the one-step Lasso, is much 
too restrictive in many cases. In this paper, our main focus is on having 0{sq) 
false positives while achieving good prediction and estimation. This is inspired 
by the behavior of the "ideal" ^o-penalized estimator. 

We have examined thresholding the Lasso with least squares refitting and the 
adaptive Lasso. Our main conclusion is that both methods can have about the 
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same prediction and estimation error as the one-stage ordinary Lasso, and that 
both gain over the one-stage Lasso in the sense of having less false positives. 
Moreover, according to our theory (and not exploiting the fact that the adaptive 
Lasso mimics thresholding and refitting using an "oracle" threshold), thresh- 
olding with least squares refitting and the adaptive Lasso perform equally well, 
even when considered at a rather fine scale. Our bounds for the adaptive Lasso 
are more sensitive to small (minimal) restricted eigenvalues or small minimal 
sparse eigenvalues, or large sparse maximal eigenvalues. Both thresholded and 
adaptive Lasso benefit from a situation with large non-zero coefficients of the 
oracle. 

We do not give an account of the tightness of our bounds. The thresholded 
Lasso allows a rather direct analysis, and we believe there is little room for 
improvement of the bounds for this method. The analysis of the adaptive 
Lasso more involved. Our comparison to thresholding might not do justice 
to the adaptive Lasso. Indeed, we have not fully exploited the finer oracle 
properties of the adaptive Lasso. 

In practice the the tuning parameters are often chosen by cross validation, 
which may correspond to a choice giving the smallest prediction error. It is 
not within the scope of this paper to prove that with cross validation, thresh- 
olding and the adaptive Lasso again have comparable theoretical performance, 
although we do believe this to be the case. As for the computational aspect, 
we observe the following. For the solution path for all Aadap) the adaptive 
Lasso needs 0(n|S'init| min(n, ISinitl)) essential operation counts. The same or- 
der of operation counts is needed when computing the thresholded Lasso for 
the whole solution path over all Athrcs- Therefore, the two methods are also 
computationally comparable. 



6 The noiseless case 

Consider a fixed target f'^ = //^t^ue ^ -^2(Q)- Let S C {1, . . . ,p} and let f^ := 
argminj=j^^ Wf/Sg — f°|| be the projection of f'^ on the |5| -dimensional linear 
space spanned by the variables {V'jljes- We denote the coefficients of is by b'^ , 
i.e.. 

The oracle set is defined by trading off dimension against fit, namely 

where the constants are now from Theorem 16.11 (or its Corollary 18. ip . We call 
the oracle, and we let 6" := b^°, i.e., iso = fb^- 

For simplicity, we assume throughout that 

||f5o-f°f = O(ALV0'(2,So)), 
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which roughly says that the approximation error does not overrule the penalty 
term. 

The initial Lasso is 

Anit : = arg mjn 1 1 1 - f ° f + A^it 1 1 /3 1 1 1 1 • 

We assume that the tuning parameter Ajnit is set at some fixed value. Of course, 
in the noiseless case, the optimal - in terms of prediction error - value for Ainit 
is Ainit = 0. However, in the noisy case, a strictly positive lower bound for Ainit 
is dictated by the noise level. Write 

/init := 5init := {j ■■ /3j,init / 0}, 5init := ll/init " f°||. (10) 



Let for 6 > 0, 

SL^={j^ l/3,,init| ><5}. 

Then ios = / c<5 is the refitted Lasso after thresholding at 6. Note that we 

init b init 

express explicitly the dependence of the thresholded estimator on the threshold 
level, which we now call 5 (instead of Athres as we did in the introduction). 
The reason for this is that the analysis of the adaptive Lasso will go via the 
thresholded Lasso with a choice of the threshold 6 that trades off prediction 
error against estimation error (see (llSp in the proof of Theorem 16. 4p . 



The adaptive Lasso is 



/5adap := argmm < 



11/^ ~ + Ainit Aadap ^ 



j,init| 



The second stage tuning parameter Aadap is again assumed to be strictly posi- 
tive. We denote the resulting adaptive variants of ([TO]) by 



/adap • //3adap' '^adap • {j • f^j, adap 7^ 0}i '^adap • ll/adap f 



As the initial and adaptive Lasso are special cases of the weighted Lasso, many 
of the results in Subsections 16. 2| 16.31 and 16.41 are consequences of those for the 
weighted Lasso as studied in Subsection 16.11 The weighted Lasso is 



/^weight := argmjn < 



11/^ — f°||^ + Ainit Awcight ^ WjlPj 



where the {wj}^^-^ are non-negative weights. 

We set /weight := //3wcigirt' 'S'weight := {j ■ /3j,weight / 0}. Moreover, we define 



I ii2 \ ^ 2 min 
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By the reparametrization /3 i— )• 7 := W/3, where W = diag(?i;i, • • • , Wp), one sees 
that the weighted Lasso is a standard Lasso with Gram matrix 



-"weight 



:= W-^EW-^. 



We emphasize however that ^weight is generahy not normahzed, i.e., generahy 
diag(S„eight) / I- 

6.1 The weighted Lasso 

We first present a bound for the prediction and estimation error and then 
consider variable selection. 

Theorem 6.1 Let S be an index set with cardinality s := \S\, satisfying for 
some constants M > and L > 0, 

w^s^ > M/L, Wwsh/Vs < M. 

Then for all j3, we have 

\\f f0||2^r,||j^ p0||2 I ^^init^wcight^" 
ll/weight-t II <^||//3s-t II + ^2(2L S) ■ 

Moreover, for all f3, we have 

rwiQ \ a ta \ W IT ^ 3||//3g-f°|p SAinitAweight^f s 

AinitAweightM (/)^(2L,6) 

Finally, it holds for all (3, that 

^ 6L||/fe-f"f , 6LAinitAweightM(g + gp) 

llPweight - P5||2 S T T Tn= ,2/0 r c \ \~r — ■ 

AinitAweight^V^ 0^(2L, 6, S + SojVSo 

We will apply the above theorem with S the set of the smaller weights. 
Corollary 6.1 Fix some arbitrary (5 > 0, and let 

flight ^ {3 ■■ < 1/6}, (5^eight)' ^ {j ■■ Wj > l/'^}- 

The indices j with wj = 1/6 can be put in either "S^gight or in its complement. 
Suppose that for some a > 0, 

l'S'weight\5'o| < asQ. 

Taking S = •Speight' = 1 ('•''T-d M = 1/6 in Theorem \6.1[ we get that for all (3, 



0||2 ^ oil f f0||2 I init Veight 



ii/weight-fir <2|i/;3,. -nr + 

weight - ~f jjjjj^ 



6Afnit^Lght(l + «)go 
<52<A^.j2,So,(l + a>o) 
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Moreover, 



3<5||/« 



f0||2 



^weight f^s^ 



< 



and 



weight 



^weight f^s^ 1 1 2 

weight 



ht(l + a)so 

AinitAweight ' '^</'^(2,S'o,(l + a)so) ' 



weight 



+ 



f0||2 



< 



6 AinitAweight (2 + a)-y/so 

So AinitAweight ' '5'Amin(2) ^0, (2 + a)so) 



weight 



+ 



In the case a = 0, one may replace in the last bound, <f>^^^{2,SQ, (2 + a)so) = 
0min(2, 50,250) by (j){2, So,2so) ■ 

Our next theme is variable selection. Th e Karush-Kuhn- Tucker {KKT) condi- 
tions (see Bertsimas and Tsitsiklis 1997| ) can be invoked to derive Lemma l6.ll 
below, where we use the notation 



[lMs\\l:=Y.-2- 



Lemma 6.1 It holds that 



|'S'weight\'S'oP ^ 4A^^j^(S'weight\5'o 

-(f |'S'weight\5'o| > So, we have 



||/weight-f°|P ll(V^)Sweight\Soll2 



\2 

weight 



3k2 

^init 



le \CI^8A2 / Jl/weight - f°|P ll(l/"^)s„eigM\5oll2 

PwcightX-JOl S oAspa^j.sp(Soj- 



^weight '^0 



\2 

^init 



(11) 



6.2 The initial Lasso 



Recall that 

For q > 1, we define 



(^init '■— ll/init — 



q •— llAnit ~ ^'^llg- 



5a : 



Theorem 6.2 The prediction error of the initial Lasso has 

1 



"imt 



and its estimation error has 
1 



'A2(2,5o) 



C(AinitSo), 



</'2(2,5o) 



©(AinitSo), 82 



1 



^>2(2,5o,2so) 



O(AinitV^)- 



The initial estimator has number of false positives 

^max('S'init\5'o) 



|5'init\'S'o 



02(2, 5o) 



O(so). 
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Considering the variable selection result, it is clear that A^jjx(5'init\5'o) < A^^^-^. 
Without further conditions, this cannot be refined, and the eigenvalue A^^^^ 
can be quite large (yet having the minimal eigenvalue of S bounded away from 
zero). Therefore, the result of Theorem 16.21 needs further conditions for good 
variable selection properties of the initial Lasso. 



6.3 Thresholding the initial estimator 

Variable selection results by thresholding are not difficult to obtain: 

\qS \q a/q <: ^1 
I'^imtX'^ol < -J- 

Hence, for 5 > 5i/sq A 82/ ^/sq, we get for g G {1,2}, 

I'S'fnitV'S'ol < So- 



(12) 



If the coefficients of the oracle are sufficiently large, thresholding will improve 
the prediction and estimation error. Here, we do not impose such minimal size 
conditions. The estimation error of the thresholded Lasso is then still easy to 
assess. Our bound for the prediction error, however, now depends on maximal 
sparse eigenvalues. 

At this stage, we invoke the noiseless counterparts of Conditions A and AA. 
Condition a We have Ainit/</'^(2, 5*0) = Osna{S)- 
Condition aa We have Aiiiit/0^(2, 5o, 2so) ^suff 5. 
Theorem 6.3 Assume Condition a. Then 

52 



If 



'^init 



p0||2 



A: 



sparse 



{so) 



\2 

^init 



and 



I'S'initV'S'o 



A, 



sparse 



{so) 



sparse 

1 



\init 



O(AfnitSo), 
0(Ainit\/so)' 



X2 

^init 
(52 



O{so) 



b\2,So,2so)_ 

The expressions for the prediction and estimation error lead to favoring the 
choice Ainit/0^(2, 5o, 2so) >isuS S of Condition aa, which yields 



f0||2 



A2 

^sparse 



i)4(2,5o,2so) 



O(AfnifSo), 



A, 



sparse 



{so) 



and 



■/^sparse (5o,2so)</.2(2,5o,2so) 
|5'init\'S'o| = 0(so)- 



0(Ainit\/io) 
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6.4 The adaptive Lasso 



Observe that the adaptive Lasso is somewhat more reluctant than thresholding 
and refitting: the latter ruthlessly disregards all coefficients with |/3j^init| < S 
(i.e., these coefficients get penalty oo), and puts zero penalty on coefficients 
with |/3j,init| > 5. The adaptive Lasso gives the coefficients with |/3j^init| < ^ 
a penalty of at least Ainit(Aadap/5) and those with |/3j,init| > S a penalty of at 
most Ainit(Aadap/'^)- (Looking ahead, we will actually need to choose Aadap ^ ^ 
in the noisy case, see Theorem 13.31 ) 

Recall 

'^adap • — II /adap f 1 1 • 



The noiseless versions of Conditions B and BB are: 
Condition b We have 

'min 

(2, So, 2so)Asparse(so) 



Ai- 



^>4(2,5o,2so) 



Osuflf (Aadap)- 



Condition bb We have 



1 

(2, ^0, 2so)Asparse(so) 

(^4(2,5o,2so) 



^suff Aadap ■ 



Note the slight discrepancy with the noisy versions: the noiseless versions are 
somewhat better. This is due to the fact that we also will need to choose Aadap 
large enough to handle the noise. 



Theorem 6.4 Assume Condition h. Then 



and 



and 



•^adap 



3adap — ^°l|l 



A 



sparse 



(«o) 



1(2,50,250) 



Aadap 



^init 



O(-^?mf50), 



A 



1/2 
sparse 



{so) 



mm \ 



Aadap 



-'adap 



6° I 



(2,5o,2so) 

Al^arse(so)0m^n(2, Sq, 2sq) 



O(AmitSo), 



^init 



62 
mm 



i„(2,5o,3so) 



'A 



and 



I'S'adapV'S'o 



■^sparse ( '5 0) 

^^(2,5o,2so) 



A 



sparse 



'so) 



adap 
Ainit 

Ainit 



0(Ainit\/io), 



A 



adap 



O(5o). 



lin (2,S'o,2so). 

Considering the hounds for the prediction and estimation error leads to favoring 
the choice of Condition bb, giving 



e2 

''adap 



AUe(^o) 

64(2,5o,2so) 



O(AfnitSo), 
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||/3adap - 6° 
||/3adap-6°| 



and 



A, 



sparse 



(so) 



0min(2, 5*0, 2so)(j)'^{2, So,2so) 



A, 



sparse 



iso)4>mm{2,So,2so) 



|'S'adap\'S'( 



<iJ2, 50,350)02(2,50,250) 

■^sparse('^o) 



cPl.j2,So,2so) 



0{so) 



O(AinitSo), 



6.5 The weighted irrepresentable condition 

This subsection will show that, even in the noiseless case, exact variable selec- 
tion needs rather strong conditions. It serves as a motivation for the perhaps 
more moderate aim of having O(so) (< O(strue)) false positives and detecting 
only the larger coefficients. Moreover, we illustrate in Example 16. II of this sub- 
section that the lower bound on the non-zero coefficients as given in Corollary 
is tight. 



It is known that the initial Lasso ess entially needs t he irr epresentable condition 
in order to have no false positives ( Zhao and Yu 2006l |). Similar statements 
can be made for the weighted Lasso. 

For a (p X p) -matrix S = {(Jj^k)- we define 

Si,i(5) := {aj^k)j,kes^ 
^2.1(5') := icrj,k)j^s, kes- 
We let Ws := diag{{wj}jes)- 

Definition We say that the weighted irrepresentable condition holds for S if 
for all vectors rg G M}^^ with Hr^Hoo < 1, one has 

\\Ws}^2AS)^i^,{S)Wsrs\\oo < 1. 



The reparametrization /3 i— t- 7 := W~^/3 leads to th e following lemma, which is 
the w eighted variant of the first part of Lemma 6.2 in lvan de Geer and Biihlmann 
2009l |. Here, we actually take fo as target, instead of its ^o-sparse approxima- 



tion f^Q. Recall 

Strue '■= {j '■ /5j,true / 0}. 

Lemma 6.2 

Suppose the weighted irrepresentable condition is met for Stj-ue- Then 5„eight C 

'S'true • 

We now consider conditions for the weighted irrepresentable condition to hold. 
Lemma 6.3 Suppose that 

Wwsh < A^i„(5X^. (13) 
Then the weighted irrepresentable condition holds for 5. 
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The next example shows that the result of Lemma 16.31 cannot be improved 
(essentially, up to the strict inequality) without assuming further conditions. 

Example 6.1 Let Stme = {1; ■ ■ ■ , s}, with cardinality s := |5true|; be the active 
set, and write 




5^1,1 "^1,2 
5^2,1 5]2,2 



We now will take a special choice for T,, which is perhaps not very representative 
when S is an empirical Gram matrix S, but it is legitimate for a worst case 
analysis (as we study here). We suppose that Si^i := I is the (s x s)-identity 
matrix, and 

5^2,1 := p{c2Ci), 

with < p < 1, and with ci an s-vector and ci a (p — s)-vector, satisfying 
||ci||2 = IIC2II2 = 1- Moreover, we suppose 112,2 is the ({p — s) x (p — s))- 
identity matrix. Then Ainin(5true) = 1 and the smallest eigenvalue of S is 
I — p. Its largest eigenvalue is 1 + p. Take ci = tf5truc/ll'"^Strue lb; and ci = 
(0, . . . , 1, 0, . . .)'^ , where the 1 is placed at arg minjg5c^^^ wj. Then 

sup WWs^ ^2A^l{Ws,,^^Ts,,J\o. = P\\ws,,j\2/W^t.- 

II II "-^truc ' true 

As a special case, suppose ci = (1, 1, ... , 1)'^ /^/s, and p = 1/2. The adaptive 
Lasso generally has 0(l/?i;S?° ) = Ainit- The irrepresentahle condition then 

true 

needs 

^/Wn^? = \\ws,.J\l = 0{\-ri,) 

j^Struc 

(which holds for example when Ainit \/s = 0{minj^Struc \Pj,truc\)-) This condition 
also shows up in Corollary VJ.IA i.e., the result there is tight. 



7 Adding noise 

After introducing the notation for the noisy case (Subsection 17. ip . we will give 
the extension of the results for the weighted Lasso to the noisy cas43 (see Theo- 
rem [Tjl]). Once this is done, results for the initial Lasso, its thresholded version, 
and for the adaptive Lasso, follow in the same way as in Subsections 16.21 16.31 
and 16.41 The new point is to take care that the tuning parameters are chosen 
in such a way that the noisy part due to variables in are overruled by the 
penalty term. In our situation, this can be done by taking Ainit, as well as 
Aadap > Ainit Sufficiently large. 



^Of separate interest is a direct comparison of the noisy initial Lasso with the noisy io- 
penahzed estimator. Replacing f'^ by Y in Corollary 18.11 (and dropping the requirement 
S C Suue) gives 



-/initll' <2min^|Y-fs||^ + 



3ALt|S| 
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We provide the result for the noisy weighted Lasso in Subsection l7.2[ Theorems 
l3.1ll3.2l and l3.3l follow from this and from some further results for the noisy case 
(their proofs are in Subsection 18. 3|) . In Sect ion 17.31 we look at more restrictive 
sparse eigenvalue conditions in the spirit of Zhang and Huang 2008l | . 



7.1 Notation for the noisy case 

Consider an n-dimensional vector of observations 

Y = f° + e. 

where f° := (f°(Xi), . . . , f°(X„))^, with Xi, . . . , X„ co-variables m some space 
X. Let {V'i}j=i be a given dictionary. 

The regression f'^, the dictionary {tpj}, and fjs := 'Yl'^jl^j ^^^^ considered 
as vectors in M". The norm we use is the normalized Euclidean norm 

11/11 ■■=\\f\\n:=\\fh/^: /GR", 
induced by the inner product 



1 " _ 

(/j /)n ~ / , fifi-i fif^ 



n ^ 

1=1 



In other words, the probability measure Q is now Q := Qn = J27=i ^xjn, the 
empirical measure of the co- variables Xi, . . . , Xn- With some abuse of notation, 
we also write 

||Y-/||2 := ||Y-/||i/n, 



and 



The design matrix X is 



1 " 

,/)n := -^e,f{X,). 



n 

i=l 



X = (v-i, . . . ,'tpp). 



We write the eigenvalues involved as before, e.g., Amax is the largest eigenvalue 
of the empirical Gram matrix S := X^X/n, and (j)^{L,S,N) is the {L,S,N)- 
restricted eigenvalue of S. The projections in L2{Qn) are also written as before, 
i.e. 

f5 := X6^ := arg min ||/ - f°||„. 

/=-X-PS 

The £o-sparse projection is^ = Ylj^So ^® defined with a larger constant (7 
instead of 3) in front of the penalty term, and a larger constant (L = 6 instead 
of -L = 2) in the restrictions of the restricted eigenvalue condition: 
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(compare with formula ([9])). 
The weighted Lasso is 

/^weight = arg min <^ 1 1 Y - //j 1 1 ^ + Ajnit Awcight XI ' ' f ' ( ""^^^ 

1^ j=i J 

Let 

/weight • •^/3wcight ' '^wsig'^t ' ' /^i, weight 7^ 0}- 

The initial and adaptive Lasso are defined as in Section [H We write /init := /« 

Pinit 

and /adap := /^^^pj with active sets 5init := {j ■ /3j,imt / 0} and S'adap := {j ■ 
/3j,adap / 0}, respectively. Let 

^init '■— ll/ftnit ~ f^lln) 

be the prediction error of the initial Lasso, and and, for q > 1, 

be its £q-error. Denote the prediction error of the adaptive Lasso by 

Oadap •- ll//3,dap ^ ll'^- 

The least squares estimator using only variables in S is also written with a 
"hat": 

is = hs ■= arg min ||Y - fpg\\n- 

A threshold level will be denoted by 5, instead of Athres as we do in Section 
[TJ The reason is again that we need to explicitly express dependence on the 
threshold level. With Athres the notation will be too complicated. We define, 
for any threshold > 0, 

SLt--={j-- l^initl 

The refitted version after thresholding, based on the data Y, is f^^ . 

'-'init 

To handle the (random) noise, we define the set 

T := \ max 4|(e, V'j)n| < Ainit 

This is the set where the (empirical) correlations between noise and design is 
"small" . 

Here Ainit is chosen in such a way that 

P(T) > 1 - a 
where (1 — a) is the confidence we want to achieve. 
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7.2 The noisy weighted Lasso 

Theorem 7.1 Suppose we are on T ■ Let S be a set with cardinality s = \S\, 
which satisfies for some positive L and M 

Aweight«i" AM) > 1, 

and 

> M/L, Wwsh/Vs < M. 

Then for all /3, 

II /weight - I lln < ^llJfe - I lln H 



and 



•\/s||(^weight)5 -/^slb + ||(/3weight)5'=l|l/-^^ 
^ 5||//3g — f'^ll^ 7AinitAwcight-Ms 
~ AinitAweight^ 4>^{^L,S) 



and 



1 1 ^weight - PsW 



2 



^ 10L||/fe-fO||^ ^ 14LA2„.,A^,ight^(^ + ^o) 



MAinitAwcight\/S0 0^(6L, S,S + Sq) Ainit Aweight\/^ 

Moreover, under the condition Aweightf^™™ > 1, 

|(5'weight n S"^)\S'oP 

II f . , , _ f0||2 ||(1/w)a \o \\l 
II J weight ^ lira ' 'Owoight\oo 



^ 16Ajng^x(('S'weight n S'^)\So) 



\2 \2 
^weight '^init 



When |(5'weight H 5'^)\5o| > sq, this implies 

11 f . , , _ f0||2 ||(1/'U;)5 \ c llo 
llJweight I lln ' '^'wcightXAo 



K^weight n 5'')\S'o| < 32A sg(so 



2 

sparseV'^u; ,2 „ \2 
^^weip-ht^O \ 



weight U init 

7.3 Another look at the number of false positives 

Here, we discuss a refinement, a ssuming a condition corresponding to the one 



used in IZhang and HuangI |2008l | 



Condition D It holds for some > sq; that 

^sparse('5*)'^l 

^H2,So)s, 
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Lemma 7.1 Suppose we are on T . Then under Condition D, 



|'S'init\5'o| 



^ ^sparse \'^* 



<A2(6,5o) 
Moreover, under Condition B, 



Osuff(l) 



-1 



0{so) 



A, 



|'S'adap\'S'o| — -^sparse (s*) 

Asparse(so)</'^(6,5'o) 



sparse 



So) 



(6,5o,2so)04(6,5o,2so) 



1/2 



Ainit 
-^adap 



Oiso) 



+ 



_0min(6, ^0, 2so)</'^(6, 5*0, 2so) 

Under Condition BB, this becomes 



-^adap 



|'S'adap\'S'o| 



■^sparse ('5* ) 

0(6, So) 



<Pl.j6,So,2so)<P\6,So) 



/.2(6,5o,2so 



1/2 



O(so) (15) 



+ 



cPl,^{6,So,2so)(^\2,So) 
<P^i6,So,2so) 



D{s^,so)0{so). 



Under Condition D, the first term in the right hand side of (jlSp is generahy 
the leading term. We thus see the adaptive Lasso replaces the potentially very 
large constant 

Osuff(l) J 

in the bound for the number of false positives of the initial Lasso by 



62 

mm 



i„(6,5o,2so)</'2(6,5o) 



.1/2 



(/.4(6,5o,2so) 

a constant which is close to 1 if the 0's do not differ too much. 

Admittedly, Condition D is difficult to interpret. On the one hand, it wants s^, 
to be large, but on the other hand, a large s* also can render Asparse(s*) large. 
We refer to IZhang and Huang! 20081 ] for examples where Condition D is met. 



8 Proofs 

We present three subsections, containing respectively the proofs for Section [6l 
Section [71 and finally Section [3l 



8.1 Proofs for Section [6t the noiseless case 

8.1.1 Proofs for Subsection I6.lt the noiseless weighted Lasso 

Proof of Theorem 16.11 Take 

w^l"" > M/L, Wwsh/Vs < M. 
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We have 

p 

1 1 /weight — f ° I P + Ainit ^weight ^ '^j I /^j, weight I < 1 1 fps ~ f ° I P + ^init -^weight ^ 

and hence 

1 1 /weig ht — f 1 1 + Ainit Aweight 1 1 (/3weight ) 5<= 1 1 1 

— Il//9s — f°|P + Ainit Aweight ^■"^j I weight — 

< ll/fe - f°ll' + Ainit AweightMV^||(/3^eight)s - Psh- 

Let AT D 5, lATI = N. Then 

||(/5weight)Ar=||l < II (^weight) 5= 111, 

and 

||(/Sweight)5 - ^2 < ||(/3weight)Ar - Psh, < VN . 

Therefore, 

1 1 /weight — f°|P + AinitAweight^5c"||(/3weight)A/^=||l 

< ll/fe - f°f + AinitAweightMV7V||(/3weight)Ar - Psh- 

Case i). If 

ll/fe - f°f < Ainit AweightMViV||(/3^eight)Ar - Psh, 

we get 

II /weight — f ° II ^ + Ainit AweightW^isc" II ((^weight ) AT^ || 1 (16) 
< 2AinitAweightMViV||(/?„eight)Ar " Psh- 

It follows that 

l|(/3weight)Arc||l < 2LV]V||(/3^eight)Ar- (^)5||2. 

But then, by the definition of restricted eigenvalue, and invoking the triangle 
inequality, 

||(/3weight)Ar - Psh < ll/weight - /fe ||/</)(2L, AT) 

< ll/weight - f°||/^(2L,Ar) + 11/^, - f'\\/cl>{2L,M). 

This gives 

ll/weight — f°|P + AinitAweight''^5c"||(/3weight)A/'=||l 
< 2AinitAweightMA/7V|| /weight - f°||/</.(2L, AT) 
+2AinitAweightMViV||/^5 - fO||/0(2L, AA) 

^ f0||2,||i^ f0||2 I 3-^hiitAweight^^^ 

< 2 ll/weight -f II +||/fe-f II + <^2(2L,AA) • 
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Hence, 

1 1 /weight - f ° I P + 2 Ainit Aweight W™1^\\ (/3„eight )nA\i < 2 1 1 //j^ - f ° | ^ H ^^^(fp.^2L N) ' 

Case ii) If 

11/^^ - fOf > Ai„itAweightMViV||(/3weight)Ar - /3s||2, 

we get 

1 1 /weight — f°|P + AinitAweight?W5™ll(/5weight)A^<:||l < 2||//3g — f^l^. 

The first result of the Lemma now follows from taking M = S. 



For the second result, we add in Case i), Amit Aweight AfvA^||(/3weight)Ar — ,55||2 to 
the left and right hand side of (jl6p : 



ll/weight - f°|P + AinitA„eight^V^||(/3„eight)A^ - (^s\\2 

+ Ainit Aweight ^5 ™ 1 1 (/^weight )nA\i 
< 3AinitAweightM\/iV||(/3weight)Ar - Psh- 

The same arguments now give 

3Ainit>/iV||(/3weight)A^ - /3s\\2 + Ajnit Aweight^i'™^" II (/5weight)Ar'= || 1 < 
II ^ c0\\2.o\\f p0||2 I '^-^init ^weight -^-^ 

ll/weight -f II +3||/fe-f|| + ^2(2L,AA) • 

In Case ii), we have 

Ainit Aweight'U^™™ II (/3weight)Ar<: 111 < '^Wfps ~^^\\'^^ 

and also 



Ainit AweightMViV||(/3weight)Ar " f^sh < Wf^s " ■ 

So then 



Ainit Aweight AfvA^ 1 1 (/^weight " /^S 1 1 2 Ainit Aweight W^^'^ \ \ (/3weight )Af'' 1 1 1 + 

<3||/fe-f°f. 
Taking M = S gives the second result. 

For the third result, we let J\f be the set S, complemented with the sq largest - 
in absolute value - coefficients of (/3weight)5'=- Then (f){2L,M) < cj){2,S,s + sq). 
Moreover, N > sq. Thus, from the second result, we get 

AinitAweight^-\/so||(/3weight)Ar — f^slh + Ainit Aweight || (/^weight )Ar'= || 1 
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3ALitAL.ht(so + s)M2 



<3||/^^ _f0||2+ w-ght 



<PH2L,S,s + so) • 
Moreover, as is sh own in Lemma 2. 2 in van de Geer and Biihlm ann' fioool] (with 



one 



ginal reference Candes and Taol |2005l |. arid" Candes and Tad |,2007. ] ) 



eight)Arc||2 < ||(/3weight)s':||l/\/so 

^ 3LII f«„ - f' 



3L||/^, - fOf + SL^^^is + s^)M^/4>\L, S, s + sq) 



Ainit ^weight ^ ^/sq 

So then 

||/3weight - /35II2 < ||(/3weight)Ar - /3s||2 + || (/3weight)ArH|2 



< 



6L||/fe - fOf + 6LAf„itALght(^ + ^o)MV0^(2L, 5, . + gp) 

□ 



We now turn to the proof of Lemma l6.ll An important characterization of the 
solu tion /3wpighi: can be derived frorn the K arush-Kuhn- Tucker (KKT) conditions 



(see iBertsimas and TsitsikHsl jl997l ]). 



Weighted KKT-conditions We have 

25j(/3Yireight Ptvuc) — -^weight Ainit^'^'^'weight' 

Here, ||rweight||oo < 1; and moreover 

'^j.weightK/^i.weight 7^ 0} = sign(/3j_weight)) J = Ij • • • )P- 

Proof of Lemma 16.11 By the weighted KKT conditions, for all j 

^(^j) /weight ) ~ AinitAweight'f'j'''j, weight- 

Hence, 

2 1 (V'j , /weight - f ° ) I ^ > Afnit A^eight 1 1 '^Swcight V^o 1 1 2 
> Afnit A^^jght 1 5'weight \ -So I ^ / 1 1 ( 1 / ^) Sweight \ 5",) 1 1 2 • 

On the other hand 

I (V'j ; /weight — f°)P < A^j^x('S'weight\5'o) || /weight — f°|P- 

Thus, we arrive at inequality (jlip : 

IC \C|2^/|a2 /'c \ r- ll/weight — f'^lP ll-'-/'^S'woight\5'oll 

I '-'weig ht\'-'0| ^ ^^Vaxl'-'weight \'-'0j 



Clearly, 



;ight \'^UJ > 2 \2 

^weight ^init 



Amax('5'weight\5'°) < A^^^^ A ( ' °' + 1 ) Agpa^j.gg(so) 



So 

□ 
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8.1.2 Proofs for Subsection 16. 2t the noiseless initial Lasso 



We first present the corollaries of Theorem 16.11 and Lemma 16.11 when we apply 
them to the case where all the weights are equal to one. 

Corollary 8.1 For the initial Lasso, Wj = 1 for allj, so we can apply Corollary 
lg.il with 6=1 and •Speight = •^'o • Let 



'^oracle • W^So W ' 

We have 



r2 ^ ollf f0||2 , SAj^JSol _ 2 

Oinit ^ ^l|l5o ~ I II "T >2(2 S ~ 



>0 

The estimation error can be bounded as follows: 



6l <3\\iSo - f°lP/Ainit + ^2'(-2^gj| - ^'^oracle/'^mit, 

and 



5o < 



r{2,So) 

62(2,5o,2so) 



^'^oracle 
Ainit\/S0 



Moreover, application of Lemma \6. 1\ bounds the number of false positives: 

2 ^'^ ■ 

|5'init\'S'o| < 4Aj^a^x('S'init\5'o)Ty^- 

'^init 

Proof of Theorem 16. 2L 

This is now a direct consequence of Corollary 18.11 □ 



8.1.3 Proofs for Subsection 16. 3t the noiseless thresholded Lasso 



We first provide some explicit bounds. 
Lemma 8.1 We have 

||(Anit)s^ <2(5i+5so, 
init 

and 

Anitis!' . - &°ll2 < 2^2 + (5^/io, 



and 



I^^L. - ll/(ftnit),. -f°l 

init 



<l|f5o-fl + 



+ 1 



Asparse(so)(2(52 + ^V^) 



and, for 6 > 62/ \/s, 



0| 



V^sparse 
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Proof of Lemma 18.11 To obtain the first result, we use 

3init)5.«^.^ - 6° 111 = 11(6° - Pinit) sf^Jh + ll(^°)5o\5f„iJll- 



Now, 
Moreover 



IK^'^-AniOs* lll<'^l 

init 

< ll(^° - Anit)5o\5f„iJll + '^^0 <h+ Ssq. 
'mit)ss . - ^°l|l < 2(5i +6so. 



Hence 



The £2-6rror of the second result follows by the same arguments. 

The first inequality of the third result follows from the definition of f^a as 
projection, and the second follows from the triangle inequality, where we invoke 
that 

I'S'initV'S'ol < 



SO that 



I'S'initI < — + So, 



52 



and thus 

maxV^init/ — 

The final result follows from 

^<5 



52 



So 



+ 1 



■^sparse(so)- 



Amiii(5'init U So) > (j) 

sparse ('S'o, |5'init\'S'o| + So) > </) sparse (5o,2so). 



Proof of Theorem 16.31 



□ 



Inserting the bound 82 = 0(Ainit-yio/(/>^(2, 5*0, 2so)) (see Theorem I6.2p . and 

llfeo - 



is, -f°|| = 0(AinitV^/(/''(2,So)), we get for Ainit/</>'(2, ^o) = 0{5), S > 



•^init 



f 



0||2 



^sparse ( So) 



+ 



|6^init-60| 



-^sparse (so) ^ 

fsparse (5o,2so) 
6 



O(AfnitSo), 



+ 



/)2(2,5o,2so) A: 



and 



|5'init\'S'o| 



X2 

^init 



O(so). 



□ 
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8.1.4 Proofs for Subsection 16. 4t the noiseless adaptive Lasso 

We use that when 6 > (^2/^/^01 then S^^:^^\So < sq- Apphcation of Corollary 16.1 
then gives 

Corollary 8.2 We have, for all 6 > 52/^/s(), and all 13 

a2 <-0\\f f0||2 I -^^^init^adap^O 

'^adap < II + S2^2^.j2,So,2soy 



and 



and 



o o II ^ ^'^^^•^^^it ^ , SAinitAadapgQ 

3adap PsLM - \. . \ . + Z^:2 



-^initAadap ^<Amm(2) 'S'o, 2so) ' 



o „ ^ ^'^"^it ~ ^ " , 12AinitAadapVi^ 
3adap — Ps^ ■ II2 ^ 1=1 T r 



/SoAinitAadap (5(?!)4in(2, 'S'o, 3so) 

and, from Lemma [?!7l 

ll/(A„it),. - f°ll' < 2||f5o - f°f + mK%.,,M5^s,. 

init 

Furthermore, from Lemma \6.1\ , 

|'S'adap\'S'o| < ^max('S'adap\'S'o)T2 Tp ■ 

adap init 

I'S'adapV'S'ol > So, we have 

I'S'adapV'S'ol < 8Asparse(so) A 2A 



2 / N_^adap__^ „ . ^^adap ^2 
sparse 1*0 J ,2 \2 ^^'-max > ^ 

^adap'^0 \rat '^adap ^init 



We note that in the above corollary, the use of the ^2-error 82 is rather crucial 
for the variable selection result: with the weights Wj = l/|/3j,init|) we have 

ll(l/f«)5\5oll2 = ||(Anit)5\5oll2 < ^2- 

With alternative weights Wj = l/-\/|/3j,init|- the theory can also be developed 
using only the £i-error 5i. 

A further observation is that the above corollary is an obstructed oracle in- 
equality, where the oracle is restricted to choose the index set as a thresholded 
set of the initial Lasso. Concentrating on prediction error, it leads to defining 
the "oracle" threshold as 

5o:=arg min (nf^. _ f0||2 + ^^inAp^o 1 

This oracle has active set Sfj^j^., with size |5f^o.J = O(so). Our following con- 
siderations however will not be based on this optimal threshold, but rather on 
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thresholds that allow a comparison with the results for the thresholded initial 
Lasso. This means that we might loose here some further favorable properties 
of the adaptive Lasso. 

Proof of Theorem 16.41 Corollary 18.21 combined with Lemma 18.11 gives that 
for all 5 > 82/ y/so-, 



-^adap 



<4||f5o-f°f + 72A^p,,,,(.o)<^^so + 



2 

sparse V 



l^\nit\dap^O 

<52^^.j2,So,2so)- 



Using moreover that ||/3adap - < ||/3adap - Ps^ \\q + \\Ps^ " and the 

_ init init 

bound of Lemma 18.11 we get for 6 > 82/ 



3adap - &°||l < 3(^S0 + 



65||f5o-f°f , 108A2p,rse(so)<53so 6AinitAadapS0 



-^init ■^adap 



+ 



and 



■^init -^adap 

12y5„-f0f 

■SO'^initAadap 



+ 



"mm 



(2,S'o,2so) 



||/3adap-?>°||2 <35\/i^ + 



^ 216Aspar.se('So)'^^\/^ _^ 9Ainit Aadap\/S0 



-^init-^adap 

Finally, again for 8 > 82/ ^/sq, 

|5'adap\5'o| < 



84>l^{2,So,3soy 



8AsVsc(so)5i f 4||f5o - f°f ^ „„,2 , u2 ^ 12A2^i,A2^,p 

- + '^^sparsel'SojO + 



\2 \2 

init adap 



So 



By Corollary E: 



Taking 



O 



520^.^2, 5o,2so) 



2(2,5o,2so) 

■^init '^adap 



(2, So, 2so)Asparse(so) 

the requirement that 8 > 82/ y/sQ is fulfilled if take 

lin (2,5o,2so)A sparse 



that is, if Condition b holds. We then obtain 



C'suff('^adap) 



adap 



A, 



sparse 



(so) 



r(2,5o,2so) 



O(AinitAadap'So); 



3adap — ^°l|l 



A, 



1/2 

sparse 



^3/2 
mm 



(2,5'o,2so) 



O(y^AinitAadap'So) 



(18) 
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and 



5adap — ft^lb 



I'S'adapX'S'o 



Al^arse(so)</'L^n(2> ^Q, 2so) 



0^i„(2,5o,3so) 



0{\/ Ainit AadapSo) ) 



■'^sparse ("*o) 



■^sparse('So) 

(2,S'o,2so) 



^init 
Aadap 



Oiso) 



□ 



8.1.5 Proofs for Subsection 16.51 on the weighted irrepresentable con- 
dition 



Proo f of Lemma 16.21 This is the weig hted variant of the first part of Lemma 
6.2 in Ivan de Geer and Biihlmannl 20091]. □ 

Proof of Lemma 16.31 We define, as in van de Geer and Biihlmann 20091 ] . 

the adaptive restricted regression 



'i?adaptive(S') := max 



1 (/fee, //3s) I 



/3e7^{lT5) ]]//3s]]2 

Here, (/, /) denotes the inner product between / and / as elements of L2{Q). 
We will show that 

1^^5112 



sup \\Ws}^2,i{S)T.^l{S)WsTsU < 
I|t"sI|oo<i ' V 



-1? 



\S\w'§}:^ 



adaptive 



{S). (19) 



It is moreover not difficult to see that ■i?adaptive('S') < y ]5]/Amin('S'), so then the 
proof of Lemma 16.31 is done. 

To derive (|19|) . we first note that 

\\Ws}T.2,i{S)Y.^\{S)WsTs\\oo < \\^2,i{S)T.^\{S)WsTs\\oo/w'i^. 



Define 
Then 



Ps:=^i\{S)WsTs. 



\Ws}^2AS)T.^l{S)WsTs\\oo = sup \^l.Ws}Y.2,i{S)i:^l{S)WsTs\ 

ll7sHll<l 

= sup \0sT.2,l{S)f3s\= sup l(//3<;c,//3s)l 

l|M^S<=/3s=l|i<l l|M/s':/3sHli<l 

< sup ](//3sc,//3s)l 

||/3s<=||i<l/«,g'i" 



sup 



l(//3s='//3s)l 



li/3sHli<lksl|2||/3s||2/«'P" \\WS\\2\\PS\\2 
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But 

WUsf r^Ws^T'iiS)WsTs \\WsTs\ 



WS\\2\\PS\\2 . LT-,A/2^^, U^^^rs--2(q\ix/^^^ \\Ws\\2 



Jr^Wlrs\/TsWs^li{S)WsTs 
We conclude that 

\\Ws.'^2,i{S)^i,\{S)WsTs\\oo < sup 



mm 



"adaptive 



\\Ps<^\\l<\\^"s\\2\\l3s\\2/w'^h 

Iwsh 



□ 



8.2 Proofs for Section O the noisy case 

Theorem 17. II gives bounds for prediction error, estimation error and the number 
of false positives of the noisy weighted Lasso. 

Proof of Theorem 17. IL We can derive the prediction and estimation results 
in the same way as in Theorem 16. H adding now the noise term: 

p 

1 1 /weight f lln "I" '^init -^weight ^ ^ "U^j I /3j,weight I 

i=i 

< 2(e, /weight — //3s )n + ||//3s — f°||n + ^initAweight ^ '"^jl^il 
^ Ainit ||/3weight - /35II1/2 + II//35 - f°||n + AinitAweight ^ 

and hence, using Aweightii'™™ > 1, 

1 1 /weight f 1 1 n ~l~ Ainit A^eig htW^™°||/35H|l/2 



<ll/fe-f°lln + 



Aint/2 + AinitAweight||w5||2/\/s 



\/s||/3weight - Psh 



As A„oight||u's||2/\/s > 1 it gives 

1 1 /weight — f°lln + AinitAweightW^5™ll^5=||l/2 
< \\fl3s - f°lln + 3AintAweight||w5||2||/3weight " l^sh/'^'- 

Now insert tyfil" > M/L, 1 < A„eightAf and {{wsh/ Vs < M: 

ll/weight — f°|ln + Ainit Aweight -^11/35'= ||l/(2i) 
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< \\ff3s — + 3AintAwcight^A/s||/5weight — f^shf^- 

The rest of the proof for the prediction and estimation error can therefore 
carried out in the same way is the proof of Theorem 16.11 

As for variable selection, we use as in Lemma fG.ll the weighted KKT conditions: 
for all j 

^(V'j ; /weight f )n j — Ajnit Aweightlfjfj -^^rgigj^t , 

where ||f„eight||oo < 1 and r,- ^eightK^.weight / 0} = sign (/3j- weight)- Invok- 
ing A weight ^^fi" > 1, we know that for all j E S'^, ^weightWj > 1. Moreover, 
2\{e,ipj)n < Ainit/2 by the definition of T- Therefore, 

/weight - f°)n|^ > Afnit'^woightll'"^S'„^igi^tnS'=\5o Hs/^' 

iGSwcightnS'^\So 
One can now proceed as in Lemma |6.1[ 

□ 



8.2.1 Proof of Lemma 17.11 with the more involved conditions 

To prove this lemma, we actually need some results in from Section [3] and an 
intermediate result in their proof. One may skip the present proof at first 
reading and first consult the next subsection (Subsection 18. Sp . 



The bound for the number of false positives of the initial lasso follows from the 
inequality 

I A \ c I ^ ^max('S'init\'S'o) ^ / ^ 
JinitX-^O S .o/fi e X <^(SOj- 

This follows from Theorem 17. H and from inserting the bound of Theorem 13.11 
for Sinit- One can then proceed by applying the inequality 

ALx(Anit\5o) < f ^^HiiV^ + l) A2p^,,,(.,). (20) 



The result for the adaptive Lasso can be derived from 

■^sparse ('^o) 



I C \ Q |2 ^ ^max ( 'S'adap \ 'S'o ) 



i(6,5o,2so) 



Ainit ^/ N 

O{so). 



A, 



adap 



This follows from (j22p (which can be found at the end of the proof of Theorem 
3.3p . invoking Condition B, and applying the bound of Theorem 13.31 for ^adap; 
and the bound of Theorem 13.11 for §2 ■ Insert again (j20p to complete the proof. 
□ 
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8.3 Proofs for Section [3] 



8.3.1 Proof of the probability inequality of Lemma 13.11 



This follows easily from the probability bound > v2t) < 2exp[— t] for a 

standard normal random variable Z . □ 



8.3.2 Proof of Theorem I3.lt the noisy initial Lasso 

Theorem 13.11 is a simplified formulation of Corollary 18.31 below. This corollary 
follows from Theorem 17.11 by taking L = 1 and S = Sq. 



Corollary 8.3 Let 



e2 .= Ilf _f0||2_|_ '''-^kiitl'S'ol 
"-^oracle • W So ^ lln ~r 



^2(6,5o,2so)' 
Take Amit > 2Anoise- We have on T, 



Moreover, on T, 
and 

Also, on T , 



"^init — 2<JoracIe- 



^1 < Sf^oi-aclc/Ainit, 
h < 10(^oj,aclc/(Vit\/so)- 



2 " <^■^■ 
|'S'init\5'o| < 16Ajjj^^(5init\5'o)-y^. 



8.3.3 Proof of Theorem 13. 2t the noisy thresholded Lasso 

The least squares estimator f using only variables in S^^^^ (i.e., the projection 

init 

of Y = f" + e on the linear space spanned by {'ipj}^p§s ) has similar prediction 

init 

properties as f^^ (the projection of f'^ on the same linear space). This is 

'-'init 

because, as is shown in the next lemma, their difference is small. 
Lemma 8.2 Let 5 > 62/ ^/sQ. Then on T , 



If _ f-^ ||2 < _ Mnit'- 
''sparse V 



'^ilit '^ilit'i" 



20sparse('S'o,2so) 

Proof of Lemma 18.21 This follows from 

'^init '^init "^init "^init 

and 

2(e,% . - fa. . )n < Ainitllft^'"" - b^H\i/2 
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< Ainit\/2s^||^"^''"' -6'^i°"||2/2 < >^imtV^\\fsS ||„/(2(/>sparse(5'o,2so)). 



□ 



Proof of Theorem [33] The bound for ||(Anit)5'init " ^^h < 2-52 + can 
be derived in the same way as in Lemma l8.ll The same is true for the bound 

init 



< ||fso-f°||n + 



r x2 



Asparsc(so)(252 + S^/s^). 



Assumption A together with Lemma 18.21 complete the proof for the bounds for 
prediction and estimation error, with the ^i-bound being a simple consequence 
of the ^2-bound. Also, the variable selection result follows from 

l'S'init\'S'o| < ^) 

and Assumption A. □ 

8.3.4 Proof of Theorem 13. 3t the noisy adaptive Lasso 

We first apply Theorem 17.11 to the adaptive Lasso. 

Corollary 8.4 Suppose we are on T. Take Aadap > > ^2/ y/so- 

We have, for all 5 > 62/ y/s^, and all (3 

?2 <9||f f0||2 I ^^-^fnit-^adap^O 

-^adap - lln+^2^2^^j6,5o,2.o)' 



and 



and 



^'^"%it"^°"" , 14AinitAadap50 



5adap — Ps^ .111 — T — 1" 



Ainit Aadap 5(/)^in(6, 5o, 2so) ' 

, 42AinitAadapVi^ 



'adap - %ji2 < — , — + 



'so Ainit Aadap ^ <Amin (6 , 5*0 , 3so ) 



Moreover 

K^adap n {SW)\So\ <S, + 32A,parse(.0)T|^Tf- A 4A, '^^^^^ 



' \ 2 \ 2 ' - --max , 

^adap'^O \rdt ^adap Ainit 

Proof of Theorem [3731 

By the same arguments as used in Lemma 18.11 for 5 > 62/ y/so, 

I'Aftnit),. - ^ ll^^o - f°lln + 3V2A%,,,M6ho, 

init 
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and IKAnit)?* ~ ^"Ib < 3(5 -^/sq. The prediction and estimation results now 

init — — — n 

follow from Corollary 18.41 combined with Condition B. 



We apply Corollary 18.41 with 



^2 ^init'^adap (21) 

V^min 

(6, ^o, 2so)Asparse 



Condition B requires that 

■Asparse('So) 



mm 



(6, 50,250) 



Ainit — Osuff(Aadap)- 



This ensures that 5 > 82/ y/s^ on the set T. Moreoyer, equation ([2T]) giyes that 
Aadap > 5 as soon as 



-^adap — 



1 



Ainit ) 



1 

(6, 5*0, 2so)Asparsc(so) 

which is also ensured by Condition B. 

The yariable selection result follows from: for 6 > ^/sQ, 

|'S'adap\5'o| < K-SadapH (S'init)^\5o| + |S'fnit\'S'o| < I (5'adap H (S'init)'^\5'o| + SO- (22) 

□ 

8.3.5 Proof of Lemma l3.3|, where coefficients are assumed to be large 

On r, for j G S'o, > ^oo, and |/3j-init| > \b]\/2, since > 25oo- More- 

oyer, for j £ S'g, |/3j,mit| < 5oo- Let 

So 

\\wsof2/so<M\ 

Note that M < l/(5oo- Since > l/(5oo, the condition Aadap-^ ^ 1 implies 

\ „,,min -I 
^adsLpWgc ^ i. 

Apply Theorem 17. II to the adaptiye Lasso with 5 = S'o, and P = b^: 



and 



a2 <9iif f0||2 I 14Af,i,A-,,pM-.o _ / Af,itA-dapM^^o \ 

"adap - ^IIISo r|ln+ ^2(6^5^) ^[ ^2^Q^So) J' 



3 _,0|| 5||fgo - f°||^ TAjnitAadapAfSQ _ / AinitAadap^-^gQ ^ 
"'"P - AinitAadapM + (^^(e,^^,) "^^ 02(6,5^) , 
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and 



II A ,0|| / 10p5o -f°lln , 28AinitAadapM^ ^ / A^it AadapM^^ 

||/^adap-& II2 < r-.. . X . + ja((, C_ o„_^ 



My^AinitAadap 0^(6, 5o,2so) V ^^(6, So, 2so) 

Also, when IS'adapX'S'ol > so, it holds that 

II f , _ f0||2 ||(i/w)a > c. Hi 

PadapX'^ol S ^^A gg(Soj 73 

-^adap^O \rnt 

'0||2 e2 



<^ Q0A2 ( „ \ ll/adap f IL ^2 
^ ^^^sparsel^oj 73 

'^adap'^O \nit 



□ 
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